-
Notifications
You must be signed in to change notification settings - Fork 529
[Numpy] Fix AWS Batch + Add Docker Support #1302
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1302 +/- ##
==========================================
+ Coverage 84.14% 84.30% +0.15%
==========================================
Files 42 42
Lines 6397 6397
==========================================
+ Hits 5383 5393 +10
+ Misses 1014 1004 -10
Continue to review full report at Codecov.
|
same "printed page" as the copyright notice for easier | ||
identification within third-party archives. | ||
|
||
Copyright [yyyy] [name of copyright owner] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me get back to you what we update this to.
RUN wget -c https://www.openssl.org/source/openssl-${OPENSSL_VERSION}.tar.gz \ | ||
&& apt-get update \ | ||
&& apt remove -y --purge openssl \ | ||
&& rm -rf /usr/include/openssl \ | ||
&& apt-get install -y \ | ||
ca-certificates \ | ||
&& tar -xzvf openssl-${OPENSSL_VERSION}.tar.gz \ | ||
&& cd openssl-${OPENSSL_VERSION} \ | ||
&& ./config && make -j $(nproc) && make test \ | ||
&& make install \ | ||
&& ldconfig \ | ||
&& cd .. && rm -rf openssl-* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean why should we install openssl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why compile openssl from source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's because we may want to have a customized version of openssl. Also, there might be security issues so that we may not install from another resource. That's the solution adopted in DLC: https://github.com/aws/deep-learning-containers/blob/95e4d9c9cba8b6dffec61637452b4bbd46bb59bd/mxnet/training/docker/1.6.0/py3/cu101/Dockerfile.gpu#L113-L124
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ubuntu manages the security fixes for you. No need to compile from source. I recommend you remove this part.
@@ -147,10 +148,10 @@ def main(): | |||
sys.exit(status == 'FAILED') | |||
|
|||
elif status == 'RUNNING': | |||
logStreamName = getLogStream(logGroupName, jobName, jobId) | |||
logStreamName = describeJobsResponse['jobs'][0]['container']['logStreamName'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Job!
Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile
@leezu I tried to compile with the latest MXNet and install horovod via Haibin's branch. However, I'm seeing this error message:
Thus, I reverted to use the mxnet wheel package instead and commented out the codes related to horovod. Would you approve it if you feel that it's appropriate? |
Instances of abusive, harassing, or otherwise unacceptable behavior may be | ||
reported by contacting the project team in GitHub issues/pull requests | ||
by mentioning @dmlc/gluon-nlp-committers. All | ||
complaints will be reviewed and investigated and will result in a response that | ||
is deemed necessary and appropriate to the circumstances. The project team is | ||
obligated to maintain confidentiality with regard to the reporter of an incident. | ||
Further details of specific enforcement policies may be posted separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommending to open a Github issue may not meet the confidentiality you promise here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an issue and revise it in a later PR? Or I may remove this CODE_OF_CONDUCT for now.
&& apt-get clean \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# Install CMake 3.13.3. The default in Ubuntu 18.04 is cmake 3.10.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pip install cmake
will be easier ;)
Horovod should have been added to the dockerfile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that docker image can be built and has been uploaded to Docker hub, and all unittests in Horovod related to Mxnet has been passed.
commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]>
commit 7525618 Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:25:38 2020 +0800 Squashed commit of the following: commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]> commit 1403c6e Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:15:44 2020 +0800 update uncased_bert_large commit 733a4b6 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 20:16:39 2020 +0800 adjust uncased_bert_large commit 770f079 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 15:10:57 2020 +0800 Revert "merge xingjian's" This reverts commit ea1f1aa. commit fe74dda Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:07:36 2020 +0800 update electra small commit 8972343 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:00:57 2020 +0800 add command to readme commit 8fcde49 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:30:47 2020 +0800 revise commit 7a625c4 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:21:58 2020 +0800 update reamde commit 071c6dd Author: ZheyuYe <[email protected]> Date: Wed Aug 19 17:14:53 2020 +0800 update bert squad command commit ea1f1aa Author: ZheyuYe <[email protected]> Date: Tue Aug 18 18:07:01 2020 +0800 merge xingjian's commit 859ab4d Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:47:01 2020 +0800 dummy example commit 633e683 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:36:31 2020 +0800 list_backbone_names commit b4aac59 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:32:51 2020 +0800 update readme commit 54301d9 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:59:06 2020 +0800 revise batch squad commit e019e27 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:58:49 2020 +0800 bash convert commit e01eda0 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 11:10:51 2020 +0800 update roberta commit 1730ff7 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 10:15:27 2020 +0800 revise submit commit de0b4c9 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:07:58 2020 +0800 upload batch files commit 175de01 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:05:02 2020 +0800 fix commit 0460ed3 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 15:48:52 2020 +0800 upload commands
* Squashed commit of the following: commit 7525618 Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:25:38 2020 +0800 Squashed commit of the following: commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]> commit 1403c6e Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:15:44 2020 +0800 update uncased_bert_large commit 733a4b6 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 20:16:39 2020 +0800 adjust uncased_bert_large commit 770f079 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 15:10:57 2020 +0800 Revert "merge xingjian's" This reverts commit ea1f1aa. commit fe74dda Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:07:36 2020 +0800 update electra small commit 8972343 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:00:57 2020 +0800 add command to readme commit 8fcde49 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:30:47 2020 +0800 revise commit 7a625c4 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:21:58 2020 +0800 update reamde commit 071c6dd Author: ZheyuYe <[email protected]> Date: Wed Aug 19 17:14:53 2020 +0800 update bert squad command commit ea1f1aa Author: ZheyuYe <[email protected]> Date: Tue Aug 18 18:07:01 2020 +0800 merge xingjian's commit 859ab4d Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:47:01 2020 +0800 dummy example commit 633e683 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:36:31 2020 +0800 list_backbone_names commit b4aac59 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:32:51 2020 +0800 update readme commit 54301d9 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:59:06 2020 +0800 revise batch squad commit e019e27 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:58:49 2020 +0800 bash convert commit e01eda0 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 11:10:51 2020 +0800 update roberta commit 1730ff7 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 10:15:27 2020 +0800 revise submit commit de0b4c9 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:07:58 2020 +0800 upload batch files commit 175de01 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:05:02 2020 +0800 fix commit 0460ed3 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 15:48:52 2020 +0800 upload commands * add mobilebert * replace remote * fix branch * fix typo Co-authored-by: Yuma1L <[email protected]>
describeJobsResponse
.@dmlc/gluon-nlp-committers
Should solve #1243 and #1139